[Model] Add Index-AniSora I2V support (V1 5B + V2 14B)#877
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 81f0eab187
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| def __init__( | ||
| self, | ||
| *, | ||
| model_path: str = "Disty0/Index-anisora-5B-diffusers", | ||
| dtype: torch.dtype = torch.bfloat16, |
There was a problem hiding this comment.
Accept od_config in AniSora pipeline constructor
OmniDiffusion instantiates all registered diffusion models via initialize_model, which always calls model_class(od_config=od_config). This constructor only accepts model_path/dtype/device, so using AniSora through the normal Omni/Diffusers loader path will immediately raise a TypeError for the unexpected od_config kwarg and prevent the model from loading at all.
Useful? React with 👍 / 👎.
| def __call__( | ||
| self, | ||
| prompt: str | list[str], | ||
| image: PIL.Image.Image, |
There was a problem hiding this comment.
Provide forward(req) entry point for AniSora V2
The diffusion engine executes models via pipeline.forward(req) (with an OmniDiffusionRequest), but this class only defines __call__(prompt, image, ...) and never overrides forward. That means nn.Module.forward will raise NotImplementedError at runtime even if the model loads, so AniSora V2 cannot be run through Omni until a forward wrapper is added.
Useful? React with 👍 / 👎.
d1c2809 to
537f736
Compare
lishunyang12
left a comment
There was a problem hiding this comment.
Thanks for your contributions. Amazing work, I will check it these two days.
|
I saw you introduce new example files. Is it possible to reuse script we already have? |
lishunyang12
left a comment
There was a problem hiding this comment.
@SamitHuang @ZJY0516 PTAL
|
Fix conflicts, thanks |
1c579ed to
b1182be
Compare
|
Thanks for the review! I rebased the PR on V2 (Wan2.1) updates:
V1 (CogVideoX) updates:
Shared updates (V1 + V2):
Testing:
|
There was a problem hiding this comment.
Pull request overview
This PR adds Index-AniSora image-to-video support to vLLM-Omni, covering both the CogVideoX-based 5B model and the Wan2.1-based 14B models, and wires them into the Omni diffusion registry and offline inference examples.
Changes:
- Extend
OmniDiffusioninitialization logic to infer AniSora V2/V3 Wan2.1-based pipelines whenmodel_index.jsonis missing and onlyconfig.jsonis available. - Register new AniSora pipelines (
AniSoraI2VCogVideoXPipelineandAniSoraV2I2VPipeline) with corresponding pre-/post-processing hooks and implement their model loading, key-conversion, and I2V sampling logic. - Update image-to-video examples and docs to describe AniSora usage and add the AniSora 5B pipeline to the supported models list.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
vllm_omni/entrypoints/omni_diffusion.py |
Adds a FileNotFoundError guard when model_index.json is absent and introduces a special-case fallback that maps AniSora V2/V3 Wan2.1-based model IDs to the AniSoraV2I2VPipeline. |
vllm_omni/diffusion/registry.py |
Registers AniSoraV2I2VPipeline and AniSoraI2VCogVideoXPipeline with their pre-/post-process hooks, while also removing the FluxPipeline registry entries and the central sequence-parallelism hook. |
vllm_omni/diffusion/models/anisora/pipeline_anisora_v2_i2v.py |
Introduces the Wan2.1-based AniSora V2/V3 I2V pipeline with hybrid loading (Wan2.1 base components + AniSora transformer weights, including key-name conversion, VAE-based conditioning, and FlowUniPC sampling). |
vllm_omni/diffusion/models/anisora/pipeline_anisora_i2v_cogvideox.py |
Adds a CogVideoX-based AniSora 5B I2V pipeline using diffusers’ native CogVideoX components and implements image encoding, 3D rotary embeddings, and DDIM-based denoising. |
vllm_omni/diffusion/models/anisora/__init__.py |
Exposes the two new AniSora pipelines as part of the diffusion models package. |
examples/offline_inference/image_to_video/image_to_video.py |
Generalizes the example script description/usage to include AniSora 5B and 14B models alongside existing Wan2.2 I2V/TI2V models. |
examples/offline_inference/image_to_video/README.md |
Expands the image-to-video README with dedicated AniSora V1/V2 sections and example commands, plus reorganized Wan2.2 usage notes. |
docs/models/supported_models.md |
Updates the supported-models table to include AniSoraI2VCogVideoXPipeline and remove some previous rows (e.g., Flux, certain TTS entries). |
docs/.nav.yml |
Adds navigation entries for LoRA inference examples and several Omni connector design docs in the user guide and design sections. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
||
| # Simple test | ||
| if __name__ == "__main__": | ||
| import urllib.request | ||
|
|
||
| print("Testing AniSora I2V CogVideoX Pipeline...") | ||
|
|
||
| # Create pipeline | ||
| pipeline = AniSoraI2VCogVideoXPipeline( | ||
| model_path="Disty0/Index-anisora-5B-diffusers", | ||
| dtype=torch.bfloat16, | ||
| ) | ||
| pipeline.to("cuda") | ||
|
|
||
| # Download test image | ||
| url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png" | ||
| urllib.request.urlretrieve(url, "/tmp/cat.png") | ||
| image = PIL.Image.open("/tmp/cat.png").convert("RGB") | ||
|
|
||
| # Generate | ||
| output = pipeline( | ||
| prompt="a cat walking in the garden, high quality", | ||
| image=image, | ||
| negative_prompt="low quality, blurry", | ||
| num_inference_steps=10, | ||
| height=480, | ||
| width=832, | ||
| num_frames=17, | ||
| ) | ||
|
|
||
| print(f"Output type: {type(output)}") | ||
| print(f"Output.output shape: {output.output.shape}") | ||
|
|
||
| # Check for NaN | ||
| if torch.isnan(output.output).any(): | ||
| print("WARNING: Output contains NaN!") | ||
| else: | ||
| print("Output looks valid (no NaN)") | ||
|
|
||
| # Save video | ||
| from diffusers.utils import export_to_video | ||
|
|
||
| video = output.output[0].permute(1, 2, 3, 0).cpu().numpy() # [C, F, H, W] -> [F, H, W, C] | ||
| video = ((video + 1) / 2 * 255).clip(0, 255).astype("uint8") | ||
| export_to_video(video, "/workspace/test_cogvideox.mp4", fps=16) | ||
| print("Video saved to /workspace/test_cogvideox.mp4") |
There was a problem hiding this comment.
The if __name__ == "__main__" block instantiates AniSoraI2VCogVideoXPipeline(model_path=..., dtype=...), but the class __init__ only takes od_config (plus keyword-only) and doesn’t accept these arguments. This makes the in-file test unusable and may mislead users about how to construct the pipeline; it should either be removed or rewritten to go through OmniDiffusionConfig / the Omni entrypoint.
| # Simple test | |
| if __name__ == "__main__": | |
| import urllib.request | |
| print("Testing AniSora I2V CogVideoX Pipeline...") | |
| # Create pipeline | |
| pipeline = AniSoraI2VCogVideoXPipeline( | |
| model_path="Disty0/Index-anisora-5B-diffusers", | |
| dtype=torch.bfloat16, | |
| ) | |
| pipeline.to("cuda") | |
| # Download test image | |
| url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png" | |
| urllib.request.urlretrieve(url, "/tmp/cat.png") | |
| image = PIL.Image.open("/tmp/cat.png").convert("RGB") | |
| # Generate | |
| output = pipeline( | |
| prompt="a cat walking in the garden, high quality", | |
| image=image, | |
| negative_prompt="low quality, blurry", | |
| num_inference_steps=10, | |
| height=480, | |
| width=832, | |
| num_frames=17, | |
| ) | |
| print(f"Output type: {type(output)}") | |
| print(f"Output.output shape: {output.output.shape}") | |
| # Check for NaN | |
| if torch.isnan(output.output).any(): | |
| print("WARNING: Output contains NaN!") | |
| else: | |
| print("Output looks valid (no NaN)") | |
| # Save video | |
| from diffusers.utils import export_to_video | |
| video = output.output[0].permute(1, 2, 3, 0).cpu().numpy() # [C, F, H, W] -> [F, H, W, C] | |
| video = ((video + 1) / 2 * 255).clip(0, 255).astype("uint8") | |
| export_to_video(video, "/workspace/test_cogvideox.mp4", fps=16) | |
| print("Video saved to /workspace/test_cogvideox.mp4") |
|
@dorhuri123 It seems that the first video has accuracy problems |
|
@ZJY0516 agreed — the first output looks clearly off (strong color inversion / desaturation compared to the input). I re‑ran the exact same settings and got a cleaner output on my side(after all the changes that were done to suite the existing example file). Could you point to the specific behavior you want to treat as the “accuracy issue” (e.g., color inversion, identity drift, motion artifacts)? That would help me isolate whether it’s still a code path issue or just variability. Same input/settings for both runs: attempt 1 anisora_v1_demo.mp4attempt 2 anisora_v1_demo.1.mp4 |
|
The shape and movement of the cat in the video don’t look quite right. Could you compare this with the official implementation to verify? @dorhuri123 |
|
@ZJY0516 I compared against the official Diffusers CogVideoXImageToVideoPipeline using the same input/settings on an RTX 6000 (Blackwell Server Edition). The output shows the same shape/motion characteristics as the vLLM run, so it looks like this is model behavior rather than an integration issue. Baseline (official diffusers) commands + script used: # env
python -m venv ~/anisora-diffusers
source ~/anisora-diffusers/bin/activate
pip install --upgrade pip
pip install diffusers==0.36.0 transformers accelerate safetensors huggingface_hub \
sentencepiece tiktoken protobuf imageio imageio-ffmpeg
pip install --pre --upgrade torch --index-url https://download.pytorch.org/whl/nightly/cu128
# input image
wget -O /tmp/cat.png https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png# run_diffusers_anisora_v1.py
import torch
import PIL.Image
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
"Disty0/Index-anisora-5B-diffusers",
torch_dtype=torch.bfloat16,
).to("cuda")
image = PIL.Image.open("/tmp/cat.png").convert("RGB")
video = pipe(
prompt="A cat playing with yarn",
image=image,
height=480,
width=720,
num_frames=81,
num_inference_steps=50,
guidance_scale=5.0,
output_type="np",
).frames[0]
export_to_video(video, "anisora_v1_diffusers.mp4", fps=16)I’ll attach the diffusers output video in this comment. If you’re seeing a specific artifact you want addressed, let me know the exact behavior and I’ll dig deeper. anisora_v1_diffusers.mp4 |
|
resolve conflicts please |
which transformers version you are using? |
|
did you keep the seed as the same for comparision with diffusers? |
I didn't pin the |
Good catch — I didn't set a fixed seed in the diffusers baseline script. The comparison was qualitative, showing that the same motion/shape characteristics appear in both implementations. I can re-run with a fixed seed in both ( |
| |`StableDiffusion3Pipeline` | Stable-Diffusion-3 | `stabilityai/stable-diffusion-3.5-medium` | | ||
| |`Flux2KleinPipeline` | FLUX.2-klein | `black-forest-labs/FLUX.2-klein-4B`, `black-forest-labs/FLUX.2-klein-9B` | | ||
| |`FluxPipeline` | FLUX.1-dev | `black-forest-labs/FLUX.1-dev` | | ||
| |`StableAudioPipeline` | Stable-Audio-Open | `stabilityai/stable-audio-open-1.0` | |
There was a problem hiding this comment.
Looks like this diff might have accidentally deleted the FluxPipeline and Qwen3TTSForConditionalGeneration rows — probably a rebase artifact? Also, would it make sense to add the AniSora V2 (14B) entry to the table too?
There was a problem hiding this comment.
Good catch — this was indeed a rebase artifact. I've synced the GPU table with upstream main (restored FluxPipeline and the three Qwen3TTSForConditionalGeneration rows) and added the AniSora V2 (14B) entry as well.
There was a problem hiding this comment.
Makes sense, thanks for cleaning that up.
| "pipeline_anisora_i2v_cogvideox", | ||
| "AniSoraI2VCogVideoXPipeline", | ||
| ), | ||
| } |
There was a problem hiding this comment.
Just something I was wondering about — registering CogVideoXImageToVideoPipeline as the key means any model declaring that class name would get routed here, including vanilla CogVideoX I2V models. Would a more specific key work better, or is there a reason for keeping it generic?
There was a problem hiding this comment.
You're right — using the generic diffusers class name would hijack vanilla CogVideoX models. I've renamed the registry key to AniSoraI2VCogVideoXPipeline and added a targeted mapping in omni_diffusion.py that only converts CogVideoXImageToVideoPipeline → AniSoraI2VCogVideoXPipeline when "anisora" appears in the model name. This way vanilla CogVideoX models are unaffected.
There was a problem hiding this comment.
Much better — scoping by model name avoids the hijacking issue.
| # Load weights from AniSora | ||
| logger.info("Downloading AniSora weights...") | ||
| import glob | ||
| import os as os_module |
There was a problem hiding this comment.
Minor nit — os is already imported at the top of the file (line 21), so the import os as os_module here shadows it a bit. Not a big deal though.
There was a problem hiding this comment.
Fixed — removed the redundant import and switched both usages to the top-level os.
|
|
||
| # Load state dict | ||
| missing, unexpected = self.transformer.load_state_dict(converted_state_dict, strict=False) | ||
| if missing: |
There was a problem hiding this comment.
I noticed missing keys are only logged at debug level, which is off by default. Since a wrong key mapping could be tricky to debug, would it help to log at warning level or add a threshold check? Just a thought.
There was a problem hiding this comment.
Good point — a broken key mapping would be very hard to diagnose with debug-level logging. Changed both missing and unexpected keys to warning level, and removed the 10-key threshold so all keys are always logged. This way any mismatch is immediately visible.
|
|
||
| # Classifier-free guidance | ||
| if do_classifier_free_guidance: | ||
| noise_uncond = self.transformer( |
There was a problem hiding this comment.
I noticed CFG here runs the transformer twice per step instead of batching conditional and unconditional together. The V1 pipeline does batch them with torch.cat. Is there a specific reason V2 does it differently, or could it use the same approach?
There was a problem hiding this comment.
No specific reason — this was an oversight. I've refactored V2 to batch conditional and unconditional inputs with torch.cat in a single forward pass, matching V1's approach. This halves the number of transformer calls per denoising step.
There was a problem hiding this comment.
Nice, batching should cut the per-step cost significantly.
| if isinstance(first_prompt, dict): | ||
| additional_info = first_prompt.get("additional_information", {}) | ||
| if isinstance(additional_info, dict) and isinstance( | ||
| additional_info.get("preprocessed_image"), PIL.Image.Image |
There was a problem hiding this comment.
I might be misreading this, but video_processor.preprocess() returns a torch.Tensor, so the isinstance(..., PIL.Image.Image) check would always be False, making this branch unreachable. Is the intent to always go through multi_modal_data instead?
There was a problem hiding this comment.
You're right — preprocessed_image is always a tensor from VideoProcessor.preprocess(), so the PIL check was dead code. I've simplified the logic to go directly to multi_modal_data["image"] which holds the PIL image needed for CLIP conditioning.
| logger.info("Encoding prompts...") | ||
| prompt_embeds, negative_prompt_embeds = self.encode_prompt(prompt, negative_prompt) | ||
|
|
||
| do_classifier_free_guidance = guidance_scale > 1.0 and negative_prompt_embeds is not None |
There was a problem hiding this comment.
Just want to make sure — guidance_scale seems to only take effect when negative_prompt_embeds is provided, but I don't see a default negative prompt being set. Is that intentional, or should there be an empty-string default when guidance_scale > 1.0?
There was a problem hiding this comment.
Not intentional — without a default, CFG was silently becoming a no-op when no negative prompt was provided. I've added a default of "" (empty string) in both V1 and V2 when guidance_scale > 1.0 and no negative prompt is given. This matches the behavior of diffusers' official pipelines.
There was a problem hiding this comment.
Good fix — matching diffusers' default behavior seems right.
|
|
||
| # Default paths for components | ||
| DEFAULT_WAN_BASE = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers" | ||
| DEFAULT_ANISORA_TRANSFORMER = "aardsoul-music/Wan2.1-Anisora-14B" |
There was a problem hiding this comment.
Small thing — DEFAULT_ANISORA_TRANSFORMER doesn't seem to be used anywhere. Is it planned for future use, or can it be removed?
There was a problem hiding this comment.
No plans for it — the transformer path always comes from od_config.model. Removed.
| model_id = (od_config.model or "").lower() | ||
| if ( | ||
| od_config.model_class_name is None | ||
| and "anisora" in model_id |
There was a problem hiding this comment.
The "anisora" in model_id check might be a bit fragile — someone with a path like /data/anisora_experiment/some-other-model could accidentally match. Would a config-based check be more reliable? Just a thought.
There was a problem hiding this comment.
Good point. I've changed both AniSora detection paths to check os.path.basename() of the model path/ID instead of the full string, so only the actual model name is matched. A fully config-based approach would require changes upstream (e.g., a field in config.json), so basename matching is the best we can do for now since these community repos don't ship model_index.json.
There was a problem hiding this comment.
Yeah, basename matching sounds like a reasonable middle ground for now.
|
|
||
| class AniSoraI2VCogVideoXPipeline(nn.Module): | ||
| # vLLM uses this flag to decide whether to feed dummy images in warmup | ||
| support_image_input = True |
There was a problem hiding this comment.
Minor thing — the class docstring is after support_image_input = True, so Python would attach it to that attribute rather than the class. Might want to move it up?
There was a problem hiding this comment.
Fixed — moved the docstring to the first statement in the class body (before support_image_input) in both V1 and V2 pipelines.
|
@vllm-omni-reviewer |
🤖 VLLM-Omni PR ReviewCode Review: Add Index-AniSora I2V Support1. OverviewThis PR adds support for Index-AniSora Image-to-Video models, supporting both the 5B (CogVideoX-based) and 14B (Wan2.1-based) variants. The implementation includes:
Overall Assessment: LGTM with suggestions - The implementation is well-structured and follows existing patterns, but has several issues that should be addressed before merging. 2. Code QualityPositive Aspects
Issues FoundCritical: Incorrect ValueError usage
raise ValueError(
"""No image is provided. This model requires an image to run.""",
"""Please correctly set `"multi_modal_data": {"image": <an image object or file path>, …}`""",
)This raises Fix: raise ValueError(
"No image is provided. This model requires an image to run. "
"Please correctly set `multi_modal_data: {image: <an image object or file path>, …}`"
)Potential Bug: Duplicate weight loading
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
"""Load weights using AutoWeightsLoader for vLLM integration."""
loader = AutoWeightsLoader(self)
return loader.load_weights(weights)The V2 pipeline loads weights manually in Suggestion: Either remove 3. Architecture & DesignPositive Aspects
ConcernsFragile model detection logic
if (
class_name == "CogVideoXImageToVideoPipeline"
and "anisora" in os.path.basename((od_config.model or "").rstrip("/")).lower()
):
class_name = "AniSoraI2VCogVideoXPipeline"This string matching could match unintended models (e.g.,
Forced download in V2 pipeline
if local_anisora:
weight_path = model_path
else:
weight_path = snapshot_download(model_path, local_files_only=False)The weight_path = snapshot_download(model_path, local_files_only=local_anisora)Missing offline mode support for Wan base
The Wan2.1 base components are loaded with 4. Security & SafetyInput ValidationMissing path validation
safetensor_files = glob.glob(os.path.join(weight_path, "*.safetensors"))
if not safetensor_files:
safetensor_files = glob.glob(os.path.join(weight_path, "**/*.safetensors"), recursive=True)The glob patterns could potentially match unintended files. Consider:
Silent failure for missing CLIP encoder
except Exception as e:
logger.warning("CLIP image encoder not available: %s", e)
self.image_processor = None
self.image_encoder = None
self.has_image_encoder = FalseCatching broad
5. Testing & DocumentationDocumentation
Missing
Suggested Test Cases
6. Specific Suggestions
|
| Line | Issue | Suggestion |
|---|---|---|
| 79-82 | ValueError with multiple args | Combine into single string |
| 85-88 | ValueError with multiple args | Combine into single string |
| 368-370 | load_weights may conflict with init loading | Document that this is for vLLM internal use only |
pipeline_anisora_v2_i2v.py
| Line | Issue | Suggestion |
|---|---|---|
| 77-84 | ValueError with multiple args | Combine into single string |
| 167-175 | Offline mode not properly supported | Add local_files_only parameter to config |
| 196-200 | Forced network access | Use local_files_only=local_anisora |
| 248-252 | Duplicate weight loading risk | Remove or document behavior clearly |
omni_diffusion.py
| Line | Issue | Suggestion |
|---|---|---|
| 71-77 | Fragile string matching | Add additional validation or document naming requirements |
| 88-97 | Fragile string matching | Same as above |
registry.py
| Line | Issue | Suggestion |
|---|---|---|
| 289-290 | Post-process func names | Consider adding docstrings explaining the func signatures |
7. Approval Status
LGTM with suggestions
The PR is well-structured and follows existing patterns in the codebase. The hybrid loading approach for V2 is necessary and well-implemented. However, the following should be addressed:
Required before merge:
- Fix the
ValueErrormulti-argument issue (affects error messages displayed to users)
Recommended:
2. Fix the local_files_only=False forced download in V2 pipeline
3. Add basic unit tests for key conversion logic
4. Document the model naming convention requirement for auto-detection
Optional improvements:
5. Consider making CLIP encoder required for V2 I2V
6. Add validation for the load_weights vs __init__ weight loading in V2
This review was generated automatically by the VLLM-Omni PR Reviewer Bot
using glm-5.
lishunyang12
left a comment
There was a problem hiding this comment.
All previous comments addressed — the CFG batching, registry rename, logging, and docstring fixes all look good. The only remaining heuristic is the basename check in omni_diffusion.py, but that's a reasonable approach for now. LGTM.
|
Thanks for the thorough review and the LGTM! I really appreciate you taking the time to go through everything. The @vllm-omni-reviewer bot flagged a few additional items — most were false positives or already covered, but it did catch a real bug: our ValueError calls were passing two string arguments (creating a tuple message) instead of a single concatenated string. Just pushed a fix for that in 57e3eb0. All feedback addressed — ready for merge whenever you're comfortable! |
|
@wtomin PTAL |
beaf8af to
7e9fd7d
Compare
PR Update: Benchmarks, E2E Tests, SP Support & V2 TP Bug ReportWhat this PR adds
Benchmark Results (2× H100 80GB)
E2E Test ResultsOffline inference — 5/5 passed (159s): Online serving — 2/2 passed (116s): Known Issue: V2 (Wan2.1 14B) Quality Degradation with TP=2We discovered a pre-existing bug in the Wan2.1 transformer's TP=2 weight sharding that causes severe mosaic artifacts. This is NOT introduced by this PR — it exists in the base Wan2.1 TP implementation. All 1143/1143 transformer weights load correctly (verified with diagnostics). V2 with TP=1 (correct output): v2_sample.1.mp4V2 with TP=2 (mosaic artifacts): v2_sample.mp4This should be tracked as a separate issue for the Wan2.1 transformer TP implementation. |
| @@ -0,0 +1,197 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Please take this RFC #1832 as reference of your online serving test script. You can test TP only now.
There was a problem hiding this comment.
Done — added test_anisora_v1_online_tp2_create_poll_download_delete (full job lifecycle with --tensor-parallel-size 2), following your guidance from #1832.
There was a problem hiding this comment.
The naming of this file has minor mismatch. Please check #1682 as a reference.
There was a problem hiding this comment.
In test-nightly-diffusion.yaml, the online serving tests are launched by:
pytest -s -v tests/e2e/online_serving/test_*_expansion.py
Therefore, I recommend you to rename this test script to test_anisora_expansion.py.
Afterwards, I can add a nightly-test label, and launch a buildkite test with this model's online serving test.
|
An existing issue related to tp accuracy problem #1713. Please check if it is the same problem. Besides, since it supports FP8, please update |
8402714 to
241d9c6
Compare
|
@wtomin Our offline tests pass Also updated |
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| """ | ||
| E2E offline inference tests for Index-AniSora I2V models. | ||
|
|
There was a problem hiding this comment.
To check the functionality, we prioritize online serving test script over offline inference script. If you test cases are overlapped in the two test scripts, I recommend you to maintain the test case (e.g., tp=2) in online serving test script, and you can delete the test case in offline inference test script. This prevents duplicated test cases.
There was a problem hiding this comment.
Removed test_anisora_v1_offline_tp2 and test_anisora_v2_offline_tp2 from the offline test file. TP=2 lifecycle coverage is now maintained only in test_anisora_online.py via test_anisora_v1_online_tp2_create_poll_download_delete as recommended.
|
Please resolve the conflicts. |
95f995e to
aeace8e
Compare
|
Conflicts resolved and rebased on latest main. Picked up upstream's doc restructure (acceleration docs merged into |
aeace8e to
1edfb06
Compare
|
@wtomin Rebased on latest main and resolved the conflicts (updated the new diffusion_features.md with AniSora V1/V2 rows and kept the registry merged). Could you take another look when you get a chance? Thanks! |
1edfb06 to
768ad3a
Compare
|
@dorhuri123 Sorry for the delay. Could you rebase to the latest main? I will try to merge this PR recently. |
…nchmarks - Add AniSora V1 (5B, CogVideoX-based) I2V pipeline with Ulysses SP, TP, and FP8 quantization support - Add AniSora V2 (14B, Wan2.1-based) I2V pipeline with AniSora→diffusers weight key conversion and TP support - Register both pipelines and their pre/post-process hooks in the diffusion registry; route via OmniDiffusion entrypoint - Add e2e offline tests (single GPU, SP=2, FP8) and online serving tests (V1 single GPU, V1 TP=2, V2) covering full job lifecycle - Add AniSora rows to `docs/models/supported_models.md` and the new VideoGen feature table in `docs/user_guide/diffusion_features.md` Signed-off-by: Dor Huri <Dorhuri123@gmail.com>
768ad3a to
043d1b0
Compare
|
@wtomin Thanks! Rebased onto latest main and resolved the conflicts, now a single clean commit, ready to merge. |
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Summary
This PR adds support for Index-AniSora Image-to-Video models, a family of anime-optimized video generation models developed by Bilibili. Supports both the 5B (CogVideoX-based) and 14B (Wan2.1-based) variants.
Closes #670
Supported Models
IndexTeam/AniSora-v1-i2v-diffusersaardsoul-music/Wan2.1-Anisora-14BDemo Results
AniSora V1 (5B) - RTX 6000
Input Image:
Generation Settings:
"A cat playing with yarn"Output Video (5.06 seconds):
anisora_v1_demo_gh.mp4
AniSora V2 (14B) - Short - NVIDIA H200
Input Image:
Generation Settings:
"a panda eating bamboo, natural lighting, detailed fur"Output Video (2.1 seconds):
anisora_v2_output_gh.mp4
AniSora V2 (14B) - Long - NVIDIA H200
Input Image:
Generation Settings:
"a woman smiling gently, soft natural lighting, cinematic quality, subtle head movement, flowing hair"Output Video (6.1 seconds):
anisora_v2_long.mp4
Usage
V1 (5B)
python examples/offline_inference/image_to_video/anisora_image_to_video.py \ --model IndexTeam/AniSora-v1-i2v-diffusers \ --image input.png \ --prompt "anime scene, smooth motion" \ --height 480 \ --width 720 \ --num_frames 81 \ --guidance_scale 5.0 \ --num_inference_steps 50 \ --fps 16 \ --output anisora_v1.mp4V2/V3 (14B)
python examples/offline_inference/image_to_video/anisora_v2_image_to_video.py \ --image input.png \ --prompt "anime scene, high quality animation" \ --height 480 \ --width 832 \ --num-frames 49 \ --guidance-scale 5.0 \ --num-inference-steps 30 \ --fps 8 \ --output anisora_v2.mp4Changes
New Files
vllm_omni/diffusion/models/anisora/- AniSora pipeline modulepipeline_anisora_i2v_cogvideox.py- V1 (5B) CogVideoX-based pipelinepipeline_anisora_v2_i2v.py- V2/V3 (14B) Wan2.1-based pipeline with hybrid loading__init__.py- Module exportsexamples/offline_inference/image_to_video/anisora_image_to_video.py- V1 CLI exampleexamples/offline_inference/image_to_video/anisora_v2_image_to_video.py- V2 CLI exampleModified Files
examples/offline_inference/image_to_video/README.md- Added AniSora documentationvllm_omni/diffusion/registry.py- Register AniSora V1/V2 pipelines and their post-/pre-process hooksTechnical Notes
V2 Hybrid Loading
The V2 pipeline uses a hybrid loading approach because community-converted AniSora weights use different config/naming:
Wan-AI/Wan2.1-I2V-14B-480P-DiffusersKey Name Conversions
Community AniSora weights use different naming conventions:
self_attn→attn1cross_attn→attn2ffn→ffk→to_k,q→to_q,v→to_v,o→to_out.0modulation→scale_shift_tableTesting
Both pipelines produce output with proper animation.